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Greedy Minimization of Weakly Supermodular Set Functions 
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Abstract 

This paper defines weak-a-supermodularity for set functions. Many optimization objectives in machine learning 
and data mining seek to minimize such functions under cardinality constrains. We prove that such problems benefit 
from a greedy extension phase. Explicitly, let S* be the optimal set of cardinality k that minimizes / and let So be 
an initial solution such that f{So)/f{S*) < p. Then, a greedy extension S D 5o of size |5| < |5o| + \ak ln(p/£)] 
yields f{S)/f{S*) < 1 + e. As example usages of this framework we give new bicriteria results for fc-means, sparse 
regression, and columns subset selection. 


1 Introduction 

Many problems in data mining and unsupervised machine learning take the form of minimizing set functions with 
cardinality constraints. More explicitly, denote by [n] the set {1,..., n} and f{S) : 21"! —>■ R+. Our goal is to mini¬ 
mize f{S) subject to jS”! < k. These problems include clustering and covering problems as well as regression, matrix 
approximation problems and many others. These combinatorial problems are hard to minimize in general. Finding 
good (e.g. constant factor) approximate solutions for them requires significant sophistication and highly specialized 
algorithms. 

In this paper we analyze the behavior of the greedy algorithm to all these problems. We start by claiming that the 
functions above are special. A trivial observation is that they are non-negative and non-increasing, that is, /(S' U T) < 
/(S) for any S,T C [n]. This immediately shows that expanding solution sets is (at least potentially) beneficial in 
terms of reducing the function value. But, monotonicity is not enough to ensure that any number of greedy extensions 
of a given solution would reduce the objective function. 

To this end we need to somehow quantify the gain of adding a single element (greedily) to a solution set. Let 
/(S) — /(S U T) be the reduction in / one gains by adding a set of elements T to the current solution S. Then, the 
average gain of adding elements from T sequentially is [/(S) — /(S U r)]/|T \ S|. One would hope that there exists 
an element ini & T\S such f{S) — f{S U {i}) > [f{S) — f{S U r)]/|T \ S\ but that would be false, in general, 
since the different element contributions are not independent of each other. Lemma[T] however, shows that this is true 
for supermodular functions. 

Definition 1. A set function f{S) : 2["'l —]R_i_ is said to be supermodular if for any two sets S,T C [n] 

f{SnT) + f{SUT)>f{S) + f{T). (1) 

Combining this fact with the idea that T could be any set, including the optimal solution S*, already gives some 
useful results for minimizing supermodular set functions. Specifically those for which f{S*) is bounded away from 
zero. Notice that /c-means is exactly this kind of problem. Section |4] gives some new bicriteria results obtainable for 
A:-means via the greedy extension algorithm of Section[3 A similar intuition gives a very famous result that the greedy 
algorithm provides a (1 — l/e)-factor approximation for maximizing set functions g(S) subject to [S'! < k if g for 
positive, monotone non-decreasing and submodular HI- 
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Alas, most problems of interest, such as regression, columns subset selection, feature selection, and outlier detec¬ 
tion (and many others) are not supermodular. In Section|2]we define the notion of weak-a-supermodularity. Intuitively, 
weak-a-supermodular functions are those conducive to greedy type algorithms. Or, alternatively, the inequality above 
holds up to some constant a > 1. Weak-a-supermodularity requirers that there exists an element i € T\S such that 
adding i first gains at least [f{S) — f{S U T)]/a\T \ S'! for some a > 1. 

As an example for this framework we show in Section|5]that Sparse Multiple Linear Regression (SMLR) is weak- 
a-supermodular. Using this fact we extend (and slightly improve) the result of ||2l for Sparse Regression and obtain 
new bicriteria results for Columns Subset Selection. 


2 Weakly Supermodular Set Functions 

In this section, we define our notation and the notion of weak-a-supermodularity. Throughout the manuscript we 
denote by [n] the set {1,... ,n}. We concern ourselves with non-negative set function f{S) : ^ ]R_|_. More 

specifically monotone non-increasing set function such that f{S) > f{SUT) for any two sets S C [n] and T C [n]. 

Definition 2. A non-negative non-increasing set function f{S) : —>■ K_|_ is said to be weakly-a-supennodular if 

there exists a > 1 such that for any two sets S,T C [n] 

f{S)-f{SUT)<a-\T\S\- max [/(^) - f{S U {*})] . (2) 

i€T\b 

This property is useful because we will later try to minimize /. It asserts that if adding T \ S' is beneficial then 
there is an element i € T\S that contributes at least a fraction of that. The reason for the name of this property might 
also be explained by the following definition and lemma. 

Lemma 1. A non-increasing non-negative supermodular function f is weakly-a-supermodularwith parameter a = 1. 

Proof For S,T C [n] order the set T \ S in an arbitrary order, i.e. T \ S = {ii, ■ ■ ■ j *|t\s|}- Define i?o = 0 and 
Rt = {* 1 , ■ • ■ *t} for f > 0. By supermodularity we have for any t 

f{S) - f{S U {t J) > /(S U Rt-i) - f{S U Rt-i U {t J) (3) 

We note that Rt-i U {it} = Rt and sum up Equation Q. 

\T\S\ \T\S\ 

{f{S) - f{S U {**})] > ^ /(S U Rt-i) - f{S U Rt-i U {* J) = /(S) - /(S U T) . 

Since |r\S| •maxjgj’\s[/(S) —/(SU{i})] > [/(•S') ~/(>S'U{it})] this implies weak-l-supermodularily. □ 

3 Greedy Extension Algorithm 

We are given a non-increasing weakly-a-supermodular set function /(S) and would like to solve the following opti¬ 
mization problem 

min{/(S) : |S| < k}. (4) 

Consider a simple greedy algorithm that starts with some initial solution Sq of value /(So) (maybe Sq = 0) and 
sequentially and greedily adds elements to it to minimize /. 

Theorem 1. Let Sr be the output of Algorithm}]] Then |Sr| < |So| + |"afcln(/(So)/£')] and f{Sr) < /(S*) + E 
where S* is an optimal solution of the optimization problem dl. 
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Algorithm 1 Greedy Extension Algorithm 

input: Weakly-a-supermodular function /, Sq, k, E 
for t = 1,..., \ak\n{f{So)/E)~\ do 
St <- St-i U argmmig[„] f{St U {i}) 

output: St 


Proof. The fact that |S't| < l^ol + \ak\n{f{So)/E)~\ is a trivial observation. For the second claim consider an 
arbitrary iteration t € [r] and consider the set S* \ St-i- By monotonicity and weak a-supermodularity 

f{St-i)-f{S*) < f{St-i)-f{St-iUS*)<ak- max f{St-i)-f{St-iU{i}) 

< ak ■ ma,xf{St-i) - f{St-i U {i}) = ak ■ {f{St-i) - /(St)) ■ 

ie[n] 

By rearranging the above equation and recursing over t we get 

f{St) - f{sn < {f{St-i) - f{S*)) (1 - IM) < if {So) - f{S*)) (1 - l/akf 

Substituting t = \ak ln(/(S'o) /E)~\ for the last step of the algorithm completes the proof. 

f{Sr)-fiS*) < 

< if (So) - /(S'*)) e- M/(So)/b) < E. 


□ 

Theorem 2. Assume there exist a p-approximation algorithm creating Sq such that /(So) < p/(S*). There exists an 
algorithm for generating S such that |S| < |So| + \a.k (in ^)] and /(S) < (1 + £)/(S*). 

Proof Use the p-approximation algorithm to create Sq for Algorithm[T]and set E = ef{So). □ 


Algorithm 2 Greedy Extension Algorithm; an alternative stopping criterion 
input: Weakly-a-supermodular function /, So, /stop 

repeat 

St ^ St-i U argmini /(St_i U {i}) 
until /(St) < /stop 

output: S = St 


Theorem 3. Let kf be the minimal cardinality of a set S' such that /(S') < /. For any /stop such that f < /stop 
Algorithm\2\outputs S such that 


1-51 < |So| 


akf I In 


fjSo) 

/stop / 


Proof The proof follows from Theorem[T]by setting k = kf and E = /stop — /• 


□ 


4 Clustering 

We will use the following auxiliary problem. 

Definition 3 (/c-Median). We are given a set X of data points, the set C of potential cluster center locations and the 
nonnegative costs Wij > Q for all i, j € X x C. Find a set S G C minimizing f{S) = Wij subject to 

|S| < k. 
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It is well known that the objective function f{S) of the fc-Median problem is supermodular and therefore weakly- 
1-supermodularby Lemma[T] Our first application is a constrained version of the fc-means clustering problem. 

Definition 4 (Constrained fc-Means). Given a set of points X C find a set S C X minimizing f{S) = 
11 ^ ~ Subject to |5| < k. 

Lemma 2. Given a set of n points X define S* the optimal solution to the constrained k-meansproblem. Namely, S* 
minimizes f{S) subject to [S'! < k. One can find in 0{n^dk\og{l/e)) time a set S of size jS”! = 0{k) + fclog(l/e) 
such that f{S) < (1 + e)f{S*). 

Proof The constrained fc-means objective function / is weakly-l-supermodular because the problem is a special case 
of the fc-Median problem defined above. Using the the algorithm of El one obtains a set Sq of size l^ol = 0{k) points 
from the data for which f{So) = 0{f{S*)). Their technique improves on the analysis of adaptive sampling method 
of Q. Greedily extending Sq and applying the analysis of Theorem[T]completes the proof. The quadratic dependency 
of the running time on the number of data points can be alleviated using the corset construction of ||5]0 □ 


The classical fc-means clustering problem is defined as follows. 

Definition 5 (Unconstrained fc-Means). Given a set of n points X C find a set S C minimizing f{S) = 
minces ||a; - cp subject to |S'| < fc. 

Lemma 3. Let f{S*) be the optimal solution to the unconstrained k-means problem. One can find in time 0{'nfdk log(l/s)) 
a set S G of size = 0(fc) -b fclog(l/e) such that f{S) < (2 -b e)f{S*). 

Proof. The proof and the algorithm are identical to the above. The only point to note is that a 1 -b e/2 approximation 
to the constrained problem is at most a 2 -b e approximation to the unconstrained one. See a, for example, for the 
argument that the minimum of the constrained objective is at most twice that of the unconstrained one. □ 


Alternatively, we can utilize a more computationally expensive approach. It is known that given an instance {X, fc) 
of the Unconstrained fc-Means problem one can construct in polynomial time an instance of the fc-Median problem 
{X, C, w, fc) where C C such that for any solution of value <1> for the Unconstrained fc-Means problem there exists 
a solution of value (1 -b s)<I> for the corresponding instance of the fc-Median problem (see Theorem 7 ||2l). Moreover, 

|C| = „0(log(l/e)/e 

Therefore, after applying this transformation on our instance of the Unconstrained fc-Means and 
using the same initial solution Sq as in Lemma|3we derive. 


Lemma 4. Let f{S*) be the optimal solution to the unconstrained k-means problem. One can find in time irffc) 

a set S G of size = 0{k) -b fclog(l/£) such that f{S) < (1 -b e)f{S*). 


5 Sparse Multiple Linear Regression 

We begin by defining the Sparse Multiple Linear Regression (SMLR) problem. Given two matrices X G and 

Y G and an integer fc find a matrix W G that minimizes ||X1U — U|||. subject to W having at most fc non 

zero rows. We assume for notational brevity (and w.l.o.g.) that the columns of X have unit norm. An alternative and 
equivalent formulation of SMLR is as follows. Let Xs be a submatrix of the matrix X defined by the columns of X 
indexed by the set S' C {1,..., n}. Let Xg be the Moore-Penrose pseudo-inverse of the matrix Xs. It is well-known 
(and easy to verify) that the minimizer of ||X1U — ^|||’ subject to W whose non zero rows are indexed by S is equal 
to \\Y — XsXgU|||.. SMLR can therefore be reformulated as 

inin {/(S) = ||y - XsX^gYWl : |S| < fc} . 

SC[n\ 

We can consequently apply our methodology from Section[3]to SMLR if we show that /(S) is a-weakly-supermodular. 

Lemma 5. For X G and Y G R™^^ the SMLR minimization function f{S) = ||y — XsXgYW^p is a-weakly- 

supermodular with a = maxg/ H-Ag/lH- 
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Proof. We first estimate f{S) — f{S U T). Denote by Zt\s the matrix whose columns are those of Xt\s projected 
away from the span Xs and normalized. More formally, Ci = ||(/ — XsXJjxiH and Zt = {I — XsXg)xi/Q for all 
i € T \ S. Note that the column span of Zx\s is orthogonal to that of Xs and that together they are equal to the 
column span of XtuS- Using the Pythagorean theorem we obtain f{S) = ||U|||- — ||XsXgy|||, and f{S U T) = 
I|U||f - - \\Zs\TZg^j,Y\\j,. Substituting T = {i} also gives f{S) - /{S U {i}) = WzizfYWj,. 


f{S)-f{SliT) = \\ZT\sZ^T\sn% 



F 

<ll(^nsril 

2 ■ I ^T\S^ F 


E ii""^ii2 


i&T\S 

^ 2■ 

|T\S'| max \\zfY\\l 
' ' ieT\s 


<a-\T\S\[f{S)-f{Sum 


(5) 

by Singular Value Decomposition (6) 

(7) 

( 8 ) 


see below (9) 

( 10 ) 


For Equation (|9|l we use a non trivial transition, ||Zy^g ||2 < By the definition of Zt\s we can write for 

i eT\S that Zi = {xi — Ci = 11(7 — 7^sXg)xi||. For any vector w G MUXS'! 


Zt\sw = ^ XiWi/Ci p'^Xj ^ Wiaij/Ci = XtuSw' 
ieT\S jes iGT\S 


where w[ = wt/Ci for i G T \ S' and tc' = Y.ieT\s j G S. Since, (i = ||(/ - XsX^)xi|| < ||xi|| = 1 

we have ||rt;'|| > ||rt;||. Finally, considers such that ||tc|| = 1 and ||ZT\ 5 rt;|| = This is the right singular 

vector corresponding to the smallest singular value of Zt\s- We obtain 

ll^nsir' = ll^nst^ll = llTTrustu'll > ||X-^5ir'lk'll > • 

Which completes the proof. □ 


Lemma 6. Let f{S*) be the optimal solution to the Sparse Multiple Linear Regression problem. One can find in time 
0{ak log(||y lll^/e) • nT/) a set S C [n] of size |S| = \ak log(||y |||./e)] such that /(S) < (1 + e)f{S*) where Tf 
is the time needed to compute /(S) once. 


6 Sparse Regression 


The problem of Sparse Regression defined in ^ is an instance of SMLR where the number of columns in U is £ = 1. 
Since both Y and W are vectors we reduce the more familiar form of this problem; minimize \\Xw — t/||| subject to 

\\w\\o < k. 

El analyzed the greedy algorithm for the sparse regression problem. He sets a desired threshold error E and 
defined k to be the minimum cardinality of a solution S* that achieves f{S*) < E' = E/4. He showed that the 
greedy algorithm finds a solution S such that f{S) < E such that 


1^1 < 


9fc- ||X+||^ln 



In his work El implicitly assumes the over constrained setting where the number of columns m in X is smaller than 
their dimension n and that X is full rank. In this setting a = maxg/ 1| = ||7f'^|| by Cauchy’s interlacing theorem. 

Here, we apply Theorem[3]with initial solution 5'o = 0 (which gives f{So) = Hylli) E' = E/4. It immediately 
yields that the greedy algorithm finds a solution of value f{S) < E such that 


\S\< 


k-\\X 


+ l|2 


bll 


2 1 


E- E/4 


< 


k-\\Xf\^o In 


E 


■ In ■ 
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This improves the result of El in three ways 1) the approximation factor is smaller by a constant factor 2) its proof 
is more streamlined and 3) it is extended to viability of the greedy algorithm to the under constrained case where the 
result of El does not hold. Specifically, where his implicit assumption that maxs^ ||2fg, || = ||2f’^|| no longer holds. 


7 Column Subset Selection Problem 


Given a matrix X, Column Subset Selection (CSS) is concerned with finding a small set of columns whose span 
captures as much of the Frobenius norm of X. It was throughly investigated in the context of numerical linear algebra 
IHIHdOl. In other words, find a subset S £ [n], 151 < fc of matrix columns the minimize f{S) = ||2f — XsXgXW^p. 
This formulation makes it clear that this is a special case of SMLR where Y = X. 

im investigated notion of a curvature c € [0,1] for a nonincreasing set functions. They define it as follows: 


c 


/(5)-/(5U{j}) 


= 1 — min mil* , , , ^ . 

jeln] S,TC[n]\{j} f {T) - f {T U {j}) 


( 11 ) 


They show that there exists a greedy type algorithm that finds a solution of value at most 1/(1 — c) times the optimal 
value of the minimization problem for any objective set function with curvature c (Corollary 8.5 in EH). 

Lemma 7 (Lemma 9.1 from 01 11 1. Let f{S) be the objective function for the Column Subset Selection Problem 
corresponding to the matrix X. The curvature c of f{S) is such that < k^{X) where k{X) is the condition 
number of X. 

Note that for any matrix Xwith full column rank if X is the matrix with normalized columns then ||2f +1| < k{X). 
We can find our initial solution Sq by one of the three known methods: 

1. an approximation algorithm from E2 finds a solution 5o such that |5o| = k and performance guarantee p = 

2. an approximation algorithm from OHO with |5o| = k and p = fc + 1; 

3. an approximation algorithm from E4ll with |5o| = 2fc and p = 2; 

Lemma 8. For the columns subset selection problem for a column normalized matrix X and a = maxg' one 

confine a set S of value f{S) < (1 + S)f(S*) such that 

\S\=0[ak (in^)). 

Proof Combining one of the above results with the algorithm from Section [^completes the proof. □ 
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